52 research outputs found

    A Similarity Measure for GPU Kernel Subgraph Matching

    Full text link
    Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners and compilers in generating high performing code

    In-Vitro Anti-Fungal Activity and Phytochemical Screening of Stem Bark Extracts from Ventilago denticulata

    Get PDF
    The objective of the present study was to assess the antifungal activity of pet. Ether extract, acetone extract, ethyl acetate, and ethanol bark extract of Ventilago denticulata (VD).The material was dried in shade made to a coarse powder and weighted quantity of the powder   (1000 g) was subjected to hot percolation in a soxhlet apparatus using petroleum ether, ethyl acetate, acetone and ethanol, at a temperature range of 40-800C. Phytochemical tests were done in presence of phytoconstituents like glycosides, alkaloids, tannins, steroids, flavonoids. The anti-fungal activity was carried out by using cup method using Sabraud’s agar as medium. Plates were incubated at 250C for 42hr and later observed for zones of inhibition. The effect of the extracts on fungal isolates was compared with Griseofluvin at a concentration of 10 mg/ml. The Ethyl acetate extract at low as well as high doses gives antifungal effect. Pet-ether extract, acetone extract and ethanolic extract did not produce any antifungal effect at both doses. Ethyl acetate extract shows zone of inhibition at low dose (T1 10 mg/ml) 10 mm and at high dose (T2  20 mg/ml) 16 mm. Keyword: Ventilago denticulata, Anti- fungal, Griseofluvin

    Performance analysis of gpu programming models using the roofline scaling trajectories

    No full text
    Performance analysis is a daunting job, especially for the rapid-evolving accelerator technologies. The Roofline Scaling Trajectories technique aims at diagnosing various performance bottlenecks for GPU programming models through the visually intuitive Roofline plots. In this work, we introduce the use of the Roofline Scaling Trajectories to capture major performance bottlenecks on NVIDIA Volta GPU architectures, such as warp efficiency, occupancy, and locality. Using this analysis technique, we explain the performance characteristics of the NAS Parallel Benchmarks (NPB) written with two programming models, CUDA and OpenACC. We present the influence of the programming model on the performance and scaling characteristics. We also leverage the insights of the Roofline Scaling Trajectory analysis to tune some of the NAS Parallel Benchmarks, achieving up to 2×\times speedup

    Overview of Application Instrumentation for Performance Analysis and Tuning

    No full text
    Profiling and tuning of parallel applications is an essential part of HPC. Analysis and improvement of the hot spots of an application can be done using one of many available tools, that provides measurement of resources consumption for each instrumented part of the code. Since complex applications show different behavior in each part of the code, it is desired to insert instrumentation to separate these parts. Besides manual instrumentation, some profiling libraries provide different ways of instrumentation. Out of these, the binary patching is the most universal mechanism, that highly improves user-friendliness and robustness of the tool. We provide an overview of the most often used binary patching tools and show a workflow of how to use them to implement a binary instrumentation tool for any profiler or autotuner. We have also evaluated the minimum overhead of the manual and binary instrumentation

    Analysis of the Jobs Resource Utilization on a Production System

    Get PDF
    Abstract. In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months

    Score-P and OMPT: Navigating the Perils of Callback-Driven Parallel Runtime Introspection

    No full text
    Event-based performance analysis aims at modeling the behavior of parallel applications through a series of state transitions during execution. Different approaches to obtain such transition points for OpenMP programs include source-level instrumentation (e.g., OPARI) and callback-driven runtime support (e.g., OMPT).In this paper, we revisit a previous evaluation and comparison of OPARI and an LLVM OMPT implementation—now updated to the OpenMP 5.0 specification—in the context of Score-P. We describe the challenges faced while trying to use OMPT as a drop-in replacement for the existing instrumentation-based approach and the changes in event order that could not be avoided. Furthermore, we provide details on Score-P measurements using OPARI and OMPT as event sources with the EPCC and SPEC OpenMP benchmark suites
    corecore